C Condensed Representations for Data Mining
نویسنده
چکیده
Condensed representations have been proposed in Mannila and Toivonen (1996) as a useful concept for the optimization of typical data-mining tasks. It appears as a key concept within the inductive database framework (Boulicaut et al., 1999; de Raedt, 2002; Imielinski & Mannila, 1996), and this article introduces this research domain, its achievements in the context of frequent itemset mining (FIM) from transactional data, and its future trends. Within the inductive database framework, knowledge discovery processes are considered as querying processes. Inductive databases (IDBs) contain not only data, but also patterns. In an IDB, ordinary queries can be used to access and manipulate data, while inductive queries can be used to generate (mine), manipulate, and apply patterns. To motivate the need for condensed representations, let us start from the simple model proposed in Mannila and Toivonen (1997). Many data-mining tasks can be abstracted into the computation of a theory. Given a language L of patterns (e.g., itemsets), a database instance r (e.g., a transactional database) and a selection predicate q, which specifies whether a given pattern is interesting or not (e.g., the itemset is frequent in r), a datamining task can be formalized as the computation of Th(L,q,r) = {φ ∈ L | q(φ,r) is true}. This also can be considered as the evaluation for the inductive query q. Notice that it specifies that every pattern that satisfies q has to be computed. This completeness assumption is quite common for local pattern discovery tasks but is generally not acceptable for more complex tasks (e.g., accuracy optimization for predictive model mining). The selection predicate q can be defined in terms of a Boolean expression over some primitive constraints (e.g., a minimal frequency constraint used in conjunction with a syntactic constraint, which enforces the presence or the absence of some subpatterns). Some of the primitive constraints generally refer to the behavior of a pattern in the data by using the so-called evaluation functions (e.g., frequency). To support the whole knowledge discovery process, it is important to support the computation of many different but correlated theories. It is well known that a generate-and-test approach that would enumerate the sentences of L and then test the selection predicate q is generally impossible. A huge effort has been made by data-mining researchers to make an active use of the primitive constraints occurring in q to achieve a tractable evaluation of useful mining queries. It is the domain of constraint-based mining (e.g., the seminal paper) (Ng et al., 1998). In real applications, the computation of Th(L,q,r) can remain extremely expensive or even impossible, and the framework of condensed representations has been designed to cope with such a situation. The idea of ε-adequate representations was introduced in Mannila and Toivonen (1996) and Boulicaut and Bykowski (2000). Intuitively, they are alternative representations of the data that enable answering to a class of query (e.g., frequency queries for itemsets in transactional data) with a bounded precision. At a given precision ε, one can be interested in the smaller representations, which are then called concise or condensed representations. It means that a condensed representation for Th(L,q,r) is a collection C ⊂ Th(L,q,r) such that every pattern from Th(L,q,r) can be derived efficiently from C. In the database-mining context, where r might contain a huge volume of records, we assume that efficiently means without further access to the data. The following figure illustrates that we can compute Th(L,q,r) either directly (Arrow 1) or by means of a condensed representation (Arrow 2) followed by a regeneration phase (Arrow 3). We know several examples of condensed representations for which Phases 2 and 3 are much less expensive than Phase 1. We now introduce the background for understanding condensed representations in the well studied context of FIM.
منابع مشابه
A Survey on Condensed Representations for Frequent Sets
Solving inductive queries which have to return complete collections of patterns satisfying a given predicate has been studied extensively the last few years. The specific problem of frequent set mining from potentially huge boolean matrices has given rise to tens of efficient solvers. Frequent sets are indeed useful for many data mining tasks, including the popular association rule mining task ...
متن کاملFrequent closed itemsets based condensed representations for association rules
After more than one decade of researches on association rule mining, efficient and scalable techniques for the discovery of relevant association rules from large high-dimensional datasets are now available. Most initial studies have focused on the development of theoretical frameworks and efficient algorithms and data structures for association rule mining. However, many applications of associa...
متن کاملTransaction Databases, Frequent Itemsets, and Their Condensed Representations
Mining frequent itemsets is a fundamental task in data mining. Unfortunately the number of frequent itemsets describing the data is often too large to comprehend. This problem has been attacked by condensed representations of frequent itemsets that are subcollections of frequent itemsets containing only the frequent itemsets that cannot be deduced from other frequent itemsets in the subcollecti...
متن کاملUsing Condensed Representations for Interactive Association Rule Mining
Association rule mining is a popular data mining task. It has an interactive and iterative nature, i.e., the user has to refine his mining queries until he is satisfied with the discovered patterns. To support such an interactive process, we propose to optimize sequences of queries by means of a cache that stores information from previous queries. Unlike related works, we use condensed represen...
متن کاملChaining Patterns
Finding condensed representations for pattern collections has been an active research topic in data mining recently and several representations have been proposed. In this paper we introduce chain partitions of partially ordered pattern collections as high-level condensed representations that can be applied to a wide variety of pattern collections including most known condensed representations ...
متن کامل